# Sentiment Classification & How to "Frame Problems" for a Neural Network

## Lesson: Curate a Dataset

* Neural networks by themselves cannot do anything.
* All it does is search for correlations between two datasets.

In [1]:
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")
    
g = open('reviews.txt', 'r') # What we know!
reviews = list(map(lambda x:x[:-1], g.readlines()))
g.close()

g = open('labels.txt', 'r') # What we WANT to know!
labels = list(map(lambda x: x[:-1].upper(), g.readlines()))
g.close()

In [2]:
pretty_print_review_and_label(10)

POSITIVE	:	this isn  t the comedic robin williams  nor is it the quirky  insane robin willi...


**Note:** The data in `reviews.txt` we're using has already been preprocessed a bit and contains only lower case characters. If we were working from raw data, where we didn't know it was all lower case, we would want to add a step here to convert it. That's so we treat different variations of the same word, like `The`, `the`, and `THE`, all the same way.

In [3]:
len(reviews)

25000

In [4]:
reviews[0]

'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   '

In [5]:
labels[0]

'POSITIVE'

## Lessons: Develop a Predictive Theory

In [6]:
print('labels.txt \t : \t reviews.txt\n')
records_to_print = [2137, 12816, 6267, 21934, 5297, 4998]
for record in records_to_print:
    pretty_print_review_and_label(record)

labels.txt 	 : 	 reviews.txt

NEGATIVE	:	this movie is terrible but it has some good effects .  ...
POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
NEGATIVE	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
POSITIVE	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...


## Project 1: Quick Theory Validation

Do counts of different words predict the label? For example, is `excellent` or `fascinating` more prevalent in reviews with positive labels? Alternatively, are words like `terrible` and `trash` more represented in reviews with negative labels?

In [18]:
# Initial investigation

sentiment_words = {"excellent": {"positive":0, "negative":0}, "fascinating":{"positive":0, "negative":0},
                    "terrible": {"positive":0, "negative":0}, "trash":{"positive":0, "negative":0}}

for i in range(len(reviews)):
    for sentiment_word in sentiment_words:
        if sentiment_word in reviews[i]:
            if labels[i] == "POSITIVE":
                sentiment_words[sentiment_word]["positive"] += 1
            else:
                sentiment_words[sentiment_word]["negative"] += 1
                
sentiment_words

{'excellent': {'positive': 1476, 'negative': 352},
 'fascinating': {'positive': 270, 'negative': 98},
 'terrible': {'positive': 217, 'negative': 1122},
 'trash': {'positive': 160, 'negative': 414}}

Here are hints and starter code.

In [19]:
from collections import Counter
import numpy as np

We will create three `Counter` objects, one for words from positive reviews, one for words from negative reviews, and one for all the words.

In [20]:
# Create three Counter objects to store positive, negative and total counts
positive_counts = Counter()
negative_counts = Counter()
total_counts = Counter()

**TODO:** Examine all the reviews. For each word in a positive review, increase the count for that word in both your positive counter and the total words counter; likewise, for each word in a negative review, increase the count for that word in both your negative counter and the total words counter.

**Note:** Throughout these projects, you should use `split(' ')` to divide a piece of text (such as a review) into individual words. If you use `split()` instead, you will get slightly different results than what the videos and solutions show.

In [23]:
#reviews[0].split(' ')

In [38]:
for i in range(len(reviews)):
    review_list = reviews[i].split(' ')
    total_counts.update(review_list)
    if labels[i] == "POSITIVE":
        positive_counts.update(review_list)
    else:
        negative_counts.update(review_list)

List the words used in positive and negative reviews, respectively, ordered from most to least commonly used.

In [42]:
positive_counts.most_common()[:20]

[('', 550583),
 ('the', 173367),
 ('.', 159704),
 ('and', 89731),
 ('a', 83715),
 ('of', 76866),
 ('to', 66774),
 ('is', 57256),
 ('in', 50221),
 ('br', 49243),
 ('it', 48036),
 ('i', 40752),
 ('that', 35636),
 ('this', 35082),
 ('s', 33824),
 ('as', 26318),
 ('with', 23251),
 ('for', 22422),
 ('was', 21917),
 ('film', 20938)]