# Dataset
Lets use [Sentiment Polarity Dataset 2.0](https://www.cs.cornell.edu/people/pabo/movie-review-data/), included in the `NLTK` library.<Br>
It consists of 1000 positive and 1000 negative processed reviews. Introduced in Pang/Lee ACL 2004. Released June 2004.

In [1]:
import nltk
from nltk.corpus import stopwords
from collections import defaultdict
from nltk import word_tokenize
import string
from nltk.probability import FreqDist

In [2]:
from nltk.corpus import movie_reviews as mr
print("The corpus contains %d reviews"% len(mr.fileids()))

The corpus contains 2000 reviews


## Shuffling

Lets shuffle the documents, otherwise they will remain sorted ["neg", "neg" ... "pos"]

In [3]:
import random
docnames=mr.fileids()
random.shuffle(docnames)

## Let's do it using some useful scikit-learn functions 


In [4]:
import sklearn
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report

### Assuming that documents are shuffled
Make sure `docnames` contain a shuffled list of documents 

In [5]:
documents=[]
tags = []
for doc in docnames:
    documents.append(mr.raw(doc))
    tags.append( doc.split('/')[0])

In [13]:
for i in range(5):
    print("{} -> {}...".format(tags[i], documents[i][:50]))

neg -> how do you judge a film that is so bad , but inten...
pos -> do you want to know the truth about cats and dogs ...
pos -> david mamet has long been my favorite screenwriter...
neg -> well , what are you going to expect ? 
it's a movi...
neg -> some concepts seem patently hopeless from the begi...


In [14]:
numtrain = int(len(documents) * 80 / 100)  # number of training documents
train_documents, test_documents = documents[:numtrain], documents[numtrain:]
train_tags, test_tags = tags[:numtrain], tags[numtrain:]

Agora que temos os conjuntos de treino e de teste separados, há que converter o texto dos documentos na sua representação vetorial. O scikit-learn tem dois métodos interessantes: `CountVectorizer` e `TfidfVectorizer`. Ambos aceitam um conjunto interessante de parâmetros, que não exploramos aqui, mas que vale a pena consultar.
- [sklearn.feature_extraction.text.CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
- [sklearn.feature_extraction.text.TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

In [26]:
vectorizer = CountVectorizer()
train_X = vectorizer.fit_transform(train_documents)
test_X = vectorizer.transform(test_documents)
print(train_X.shape, test_X.shape)

(1600, 36205) (400, 36205)


Podemos verificar que as features são na verdade as "palavras" dos textos, onde também se incluem números e outras coisas extranhas.

In [27]:
print(vectorizer.get_feature_names()[600:700])

['abandonment', 'abandons', 'abating', 'abba', 'abberation', 'abberline', 'abbott', 'abbotts', 'abbreviated', 'abby', 'abc', 'abducted', 'abductees', 'abduction', 'abductions', 'abe', 'abel', 'aberdeen', 'aberration', 'abetted', 'abetting', 'abeyance', 'abhorrence', 'abhorrent', 'abider', 'abides', 'abiding', 'abigail', 'abiility', 'abilities', 'ability', 'abject', 'ablaze', 'able', 'ably', 'abnormal', 'abnormally', 'abo', 'aboard', 'abode', 'abolish', 'abolitionist', 'abolitionists', 'abominable', 'abomination', 'aborbed', 'aborginal', 'aboriginal', 'aboriginals', 'abort', 'aborted', 'abortion', 'abortionist', 'abortions', 'aboslutely', 'abound', 'abounded', 'abounding', 'abounds', 'about', 'abouts', 'above', 'abraded', 'abraham', 'abrahams', 'abrams', 'abrasive', 'abreast', 'abril', 'abroad', 'abrupt', 'abruptly', 'abs', 'absence', 'absences', 'absense', 'absent', 'absinthe', 'absoloute', 'absolute', 'absolutely', 'absolutes', 'absolution', 'absolutist', 'absorb', 'absorbant', 'absor

In [28]:
classifier = MultinomialNB()

In [29]:
classifier.fit(train_X, train_tags)

MultinomialNB()

In [30]:
pred = classifier.predict(test_X)

In [31]:
score = sklearn.metrics.accuracy_score(test_tags, pred)
print("accuracy:   %0.3f" % score)

accuracy:   0.845


# Como usar o classificador para processar frases novas?

1. É necessário fazer exatamente o mesmo processamento que foi anteriormente feito ao conjunto de teste.
2. Aplicar o classificador ao resultado desses processamento

In [37]:
frases = ["I love movies very much", 
          "I hate my stupid life",
          "I am disapointed with the argument"]
frases_X = vectorizer.transform(frases)
classifier.predict(frases_X)

array(['pos', 'neg', 'pos'], dtype='<U3')