# Dataset
Lets use [Sentiment Polarity Dataset 2.0](https://www.cs.cornell.edu/people/pabo/movie-review-data/), included in the `NLTK` library.<Br>
It consists of 1000 positive and 1000 negative processed reviews. Introduced in Pang/Lee ACL 2004. Released June 2004.

In [1]:
import nltk
from nltk.corpus import stopwords
from collections import defaultdict
from nltk import word_tokenize
import string
from nltk.probability import FreqDist

In [2]:
from nltk.corpus import movie_reviews as mr
print("The corpus contains %d reviews"% len(mr.fileids()))

The corpus contains 2000 reviews


## Shuffling

Lets shuffle the documents, otherwise they will remain sorted ["neg", "neg" ... "pos"]

In [3]:
import random
docnames=mr.fileids()
random.shuffle(docnames)

## Let's do it using some useful scikit-learn functions 


In [4]:
import sklearn
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report

### Assuming that documents are shuffled
Go back to the shuffle section, and make sure `docnames` contain a shuffled list of documents 

In [5]:
documents=[]
tags = []
for doc in docnames:
    documents.append(mr.raw(doc))
    tags.append( doc.split('/')[0])

In [6]:
for i in range(2):
    print(tags[i], documents[i])

neg the second serial-killer thriller of the month is just awful . 
oh , it starts deceptively okay , with a handful of intriguing characters and some solid location work . 
after a baby-sitter gets gutted in the suit- ably spooky someone's-in-the-house prologue , parallel stories unfold , the first involving a texas sheriff ( r . lee emery ) , a gruesome double murder , and the arrival of a morose fbi agent ( dennis quaid ) on the eve of voting for the local lawman's reelection . 
the second pairs a hitch- hiker ( jared leto ) with a friendly former railroad worker ( danny glover ) . 
they're headed west , toward the rockies and away from the murder scene . 
which one is the killer ? 
well , it doesn't really matter , 'cause when writer/first-time director jeb stuart ( die hard ) finally spills the beans , you won't take his choice seriously anyway . 
the whole thing goes south about an hour in , with the tale taking hairpin turns that i certainly couldn't follow . 
and through the wh

In [7]:
numtrain = int(len(documents) * 80 / 100)  # number of training documents
train_documents, test_documents = documents[:numtrain], documents[numtrain:]
train_tags, test_tags = tags[:numtrain], tags[numtrain:]

In [8]:
vectorizer = CountVectorizer()
train_X = vectorizer.fit_transform(train_documents)
test_X = vectorizer.transform(test_documents)

In [23]:
print(vectorizer.get_feature_names()[1000:1100])

['adrian', 'adrianne', 'adrien', 'adrienne', 'adrift', 'adroit', 'adroitly', 'ads', 'adulation', 'adult', 'adulterer', 'adulterous', 'adultery', 'adulthood', 'adultrous', 'adults', 'advance', 'advanced', 'advancement', 'advances', 'advancing', 'advantage', 'advantaged', 'advantages', 'advent', 'adventure', 'adventurer', 'adventures', 'adventurous', 'adversarial', 'adversaries', 'adversary', 'adverse', 'adversely', 'adversity', 'advertise', 'advertised', 'advertisement', 'advertisements', 'advertiser', 'advertising', 'advertisment', 'advice', 'advil', 'advisable', 'advise', 'advised', 'adviser', 'advisers', 'advises', 'advising', 'advisor', 'advisors', 'advocate', 'advocated', 'advocates', 'advocating', 'aerial', 'aerosmith', 'aerospace', 'aesthetic', 'aesthetically', 'aesthetics', 'afa', 'afar', 'afeminite', 'affability', 'affable', 'affair', 'affairs', 'affay', 'affect', 'affectations', 'affected', 'affecting', 'affection', 'affectionate', 'affectionately', 'affections', 'affects', 'a

In [24]:
print(train_X.shape, test_X.shape)

(1600, 36284) (400, 36284)


In [25]:
classifier = MultinomialNB()

In [27]:
classifier.fit(train_X, train_tags)

MultinomialNB()

In [28]:
pred = classifier.predict(test_X)

In [29]:
score = sklearn.metrics.accuracy_score(test_tags, pred)
print("accuracy:   %0.3f" % score)

accuracy:   0.807


# Como usar o classificador para processar frases novas?

1. É necessário fazer exatamente o mesmo processamento que foi anteriormente feito ao conjunto de teste.
2. Aplicar o classificador ao resultado desses processamento

In [30]:
frases = ["I love movies very much", 
          "I hate my stupid life"]
frases_X = vectorizer.transform(frases)
classifier.predict(frases_X)

array(['pos', 'neg'], dtype='<U3')