# Creating our own classifier
Lets use [Sentiment Polarity Dataset 2.0](https://www.cs.cornell.edu/people/pabo/movie-review-data/), included in the `NLTK` library.<Br>
It consists of 1000 positive and 1000 negative processed reviews. Introduced in Pang/Lee ACL 2004. Released June 2004.

In [None]:
import nltk
from nltk.corpus import stopwords
from collections import defaultdict
from nltk import word_tokenize
import string

In [None]:
from nltk.corpus import movie_reviews as mr
print("The corpus contains %d reviews"% len(mr.fileids()))

In [None]:
for i in mr.fileids()[995:1005]: # Reviews 995 to 1005
    print(i, "==>", i.split('/')[0])

Let's see the content of one of these reviews

In [None]:
print(mr.raw(mr.fileids()[995]))

### Checking wich are the most frequent words

Calculating the frequency of each word in the document ...

In [None]:
from nltk.probability import FreqDist
FreqDist(mr.raw(mr.fileids()[1]).split())

Lets take a look at the most frequent words in the corpus

The previous code has flaws because split() is a very basic way of finding the words. Let's use `word_tokenize()` or `mr.words()` instead...

In [None]:
wordfreq = FreqDist()
for i in mr.fileids():
    wordfreq += FreqDist(w.lower() for w in word_tokenize(mr.raw(i)))
print(wordfreq)
print(wordfreq.most_common(30))

stop words and punctuation are causing trouble, lets remove them...

In [None]:
stopw = stopwords.words('english')
wordfreq = FreqDist()
for i in mr.fileids():
    wordfreq += FreqDist(w.lower() for w in word_tokenize(mr.raw(i)) 
        if w.lower() not in stopw and w.lower() not in string.punctuation)
print(wordfreq)
print(wordfreq.most_common(30))

## Shuffling

Lets shuffle the documents, otherwise they will remain sorted ["neg", "neg" ... "pos"]

In [None]:
import random

Lets read each document into words ...

In [None]:
docnames=mr.fileids()
random.shuffle(docnames)
documents=[]
for i in docnames:
    y = i.split('/')[0]
    documents.append( (mr.raw(i), y) )

Let's take a look at our documents...

In [None]:
for docs in documents[0:2]:
    print(docs)

## Document representation

Now, lets produce the final document representation, in the form of a Frequency Distribution ...

First, without stop words and punctuation ... (you could use other technique, such as IDF)

In [None]:
stopw = stopwords.words('english')
docrep=[]
for text,tag in documents:
    features = FreqDist(w for w in word_tokenize(text) 
                        if w.lower() not in stopw and w.lower() not in string.punctuation)
    docrep.append( (features, tag) )

Let's take a look at our documents again...

In [None]:
for doc in docrep[:5]:
    print(doc)

## NLTK classifier: Naive Bayes

Defining our training and test sets...

In [None]:
numtrain = int(len(documents) * 80 / 100)  # number of training documents

In [None]:
train_set, test_set = docrep[:numtrain], docrep[numtrain:]

In [None]:
print(test_set[0])

In [None]:
from nltk.classify import NaiveBayesClassifier as nbc

In [None]:
classifier = nbc.train(train_set)
print("Accuracy: {:.3f}".format( nltk.classify.accuracy(classifier, test_set)))

Outra forma (mais genérica) de avaliar a Accuracy...

In [None]:
from nltk.metrics import scores
test_ref = [tag for doc,tag in test_set]
test_pred = classifier.classify_many([doc for doc,tag in test_set])
print("Accuracy: {:.3f}".format(scores.accuracy(test_pred, test_ref) ) )

In [None]:
classifier.show_most_informative_features(5)

Para simplificar as próximas etapas, vamos criar um procedimento que faz o treino e a avaliação logo de seguida....

In [None]:
from nltk.classify import NaiveBayesClassifier as nbc
from nltk.metrics import scores

def train_and_evaluate(train_set, test_set):
    classifier = nbc.train(train_set)
    test_ref = [tag for doc,tag in test_set]
    test_pred = classifier.classify_many([doc for doc,tag in test_set])
    print("Accuracy: {:.3f}".format(scores.accuracy(test_pred, test_ref) ) )
    return classifier

In [None]:
train_and_evaluate(train_set, test_set)

### Now, let's select only the most relevant words ...

In [None]:
feature_counts=FreqDist()
for doc_feature_counts, _ in train_set:
    feature_counts += doc_feature_counts
print(feature_counts.most_common(30))

In [None]:
selected_features=[f for f,ntimes in feature_counts.most_common(1000)]

Using the word *frequency* in each document... (after executing, test the performance)

In [None]:
def docs2features(docs, selected_features):
    features = []
    for doc_feature_counts, tag in docs:
        features.append(({w:f for w,f in doc_feature_counts.items() if w in selected_features}, tag))
    return features

train_set = docs2features(docrep[:numtrain], selected_features)
test_set = docs2features(docrep[numtrain:], selected_features)

In [None]:
print(test_set[0])

In [None]:
train_and_evaluate(train_set, test_set)

For each one of the *selected_features*, use its frequency in each document... (after executing, go back and test the performance)

In [None]:
def docs2features(docs, selected_features):
    features = []
    for doc_feature_counts, tag in docs:
        features.append(({f:doc_feature_counts[f] for f in selected_features}, tag))
    return features

train_set = docs2features(docrep[:numtrain], selected_features)
test_set = docs2features(docrep[numtrain:], selected_features)

In [None]:
print(test_set[0])

In [None]:
train_and_evaluate(train_set, test_set)

## Now with part-of-speech TAGS

In [None]:
nltk.pos_tag(nltk.word_tokenize("time flies like an arrow"))

In [None]:
nltk.pos_tag(["he", "flies"])

In [None]:
print(documents[0])

In [None]:
nltk.pos_tag(word_tokenize(documents[0][0]))[:10]

In [None]:
stopw = stopwords.words('english')
docrep=[]
for text,tag in documents:
    features = FreqDist("%s_%s"%(w,p) for w,p in nltk.pos_tag(word_tokenize(text)) 
                        if w.lower() not in stopw and w.lower() not in string.punctuation)
    docrep.append( (features, tag) )

In [None]:
docrep[0]

In [None]:
feature_counts=FreqDist()
for doc_feature_counts, t in docrep:
    feature_counts += doc_feature_counts
feature_counts.most_common(10)
selected_features=[f for f,freq in feature_counts.most_common(1000)]

In [None]:
def docs2features(docs, selected_features):
    features = []
    for doc_feature_counts, tag in docs:
        features.append(({f:doc_feature_counts[f] for f in selected_features}, tag))
    return features

train_set = docs2features(docrep[:numtrain], selected_features)
test_set = docs2features(docrep[numtrain:], selected_features)

In [None]:
print(test_set[0])

Lets check the results again ...

In [None]:
classifier = train_and_evaluate(train_set, test_set)

In [None]:
classifier.show_most_informative_features(5)

## Exercício
Calcule o desempenho do TextBlob usando o mesmo conjunto de teste.

<!--
y=[]
y_pred=[]
for fn in docnames[numtrain:]:
    y.append(fn.split('/')[0])
    if TextBlob(mr.raw(fn)).sentiment.polarity >= 0:
        y_pred.append("pos")
    else:
        y_pred.append("neg")

from sklearn import metrics
print("Accuracy: ", metrics.accuracy_score(y, y_pred))
-->