# Dataset
Lets use [Sentiment Polarity Dataset 2.0](https://www.cs.cornell.edu/people/pabo/movie-review-data/), included in the `NLTK` library.<Br>
It consists of 1000 positive and 1000 negative processed reviews. Introduced in Pang/Lee ACL 2004. Released June 2004.

In [None]:
import nltk
# Normalmente apenas necessário se usar o google colab
nltk.download(['movie_reviews','punkt','stopwords','averaged_perceptron_tagger'])

from nltk.corpus import stopwords
from collections import defaultdict
from nltk import word_tokenize
import string
from nltk.probability import FreqDist

In [None]:
from nltk.corpus import movie_reviews as mr
print("The corpus contains %d reviews"% len(mr.fileids()))

## Shuffling

Lets shuffle the documents, otherwise they will remain sorted ["neg", "neg" ... "pos"]

In [None]:
import random
docnames=mr.fileids()
random.shuffle(docnames)

## Let's do it using some useful scikit-learn functions 


In [None]:
import sklearn
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report

### Assuming that documents are shuffled
Make sure `docnames` contain a shuffled list of documents 

In [None]:
documents=[]
tags = []
for doc in docnames:
    documents.append(mr.raw(doc))
    tags.append( doc.split('/')[0])

In [None]:
for i in range(5):
    print("{} -> {}...".format(tags[i], documents[i][:50]))

In [None]:
numtrain = int(len(documents) * 80 / 100)  # number of training documents
train_documents, test_documents = documents[:numtrain], documents[numtrain:]
train_tags, test_tags = tags[:numtrain], tags[numtrain:]

Agora que temos os conjuntos de treino e de teste separados, há que converter o texto dos documentos na sua representação vetorial. O scikit-learn tem dois métodos interessantes: `CountVectorizer` e `TfidfVectorizer`. Ambos aceitam um conjunto interessante de parâmetros, que não exploramos aqui, mas que vale a pena consultar.
- [sklearn.feature_extraction.text.CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
- [sklearn.feature_extraction.text.TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

In [None]:
vectorizer = CountVectorizer()
train_X = vectorizer.fit_transform(train_documents)
test_X = vectorizer.transform(test_documents)
print(train_X.shape, test_X.shape)

Podemos verificar que as features são na verdade as "palavras" dos textos, onde também se incluem números e outras coisas extranhas.

In [None]:
print(vectorizer.get_feature_names()[600:700])

In [None]:
classifier = MultinomialNB()

In [None]:
classifier.fit(train_X, train_tags)

In [None]:
pred = classifier.predict(test_X)

In [None]:
score = sklearn.metrics.accuracy_score(test_tags, pred)
print("accuracy:   %0.3f" % score)

# Como usar o classificador para processar frases novas?

1. É necessário fazer exatamente o mesmo processamento que foi anteriormente feito ao conjunto de teste.
2. Aplicar o classificador ao resultado desses processamento

In [None]:
frases = ["I love movies very much", 
          "I hate my stupid life",
          "I am disapointed with the argument"]
frases_X = vectorizer.transform(frases)
classifier.predict(frases_X)