# Introduction
Existem já um conjunto de bibliotecas que permitem classificar um pedaço de texto quanto ao sentimento. Uma dessas bibliotecas é a "textblob".

In [None]:
from textblob import TextBlob
from pprint import pprint

In [None]:
texts=["The movie was good.", 
    "The movie was not good.",
    "I really think this product sucks.",
    "Really great product.",
    "I don't like this product"]
for t in texts:
    print(t, "==>", TextBlob(t).sentiment.polarity)

The previous code assumes that the text is already split into sentences, which may not be the case of texts comming from sources, such as *web pages* or *blogs*. An alternate solution would be to give the whole text to `textblob` as follows.

In [None]:
text=TextBlob("""The movie was good. The movie was not good. I really think this product sucks.
Really great product. I don't like this product""")

In [None]:
for s in text.sentences:
    print("=>", s)

In [None]:
for s in text.sentences:
    print(s, "==> ", s.sentiment.polarity)

# Creating our own classifier
Lets use [Sentiment Polarity Dataset 2.0](https://www.cs.cornell.edu/people/pabo/movie-review-data/), included in the `NLTK` library.<Br>
It consists of 1000 positive and 1000 negative processed reviews. Introduced in Pang/Lee ACL 2004. Released June 2004.

In [None]:
import nltk
from nltk.corpus import stopwords
from collections import defaultdict
from nltk import word_tokenize
import string

In [None]:
from nltk.corpus import movie_reviews as mr
print("The corpus contains %d reviews"% len(mr.fileids()))

In [None]:
for i in mr.fileids()[995:1005]: # Reviews 995 to 1005
    print(i, "==>", i.split('/')[0])

Let's see the content of one of these reviews

In [None]:
print(mr.raw(mr.fileids()[995]))

## Counting manually

Calculating the frequency of each word in the document ...

In [None]:
from nltk.probability import FreqDist
FreqDist(mr.raw(mr.fileids()[1]).split())

Lets take a look at the most frequent words in the corpus

The previous code has flaws because split() is a very basic way of finding the words. Let's use `word_tokenize()` or `mr.words()` instead...

In [None]:
wordfreq = FreqDist()
for i in mr.fileids():
    wordfreq += FreqDist(w.lower() for w in mr.words(i))
print(wordfreq)
pprint(wordfreq.most_common(10))

stop words and punctuation are causing trouble, lets remove them...

In [None]:
stopw = stopwords.words('english')
wordfreq = FreqDist()
for i in mr.fileids():
    wordfreq += FreqDist(w.lower() for w in mr.words(i) if w.lower() not in stopw and w.lower() not in string.punctuation)
print(wordfreq)
pprint(wordfreq.most_common(10))

## Shuffling

Lets shuffle the documents, otherwise they will remain sorted ["neg", "neg" ... "pos"]

In [None]:
import random
docnames=mr.fileids()
random.shuffle(docnames)

Lets split each document into words ...

In [None]:
documents=[]
for i in docnames:
    y = i.split('/')[0]
    documents.append( ( mr.words(i) , y) )

Let's take a look at our documents...

In [None]:
for docs in documents[:5]:
    print(docs)

## Document representation

Now, lets produce the final document representation, in the form of a Frequency Distribution ...

First, without stop words and punctuation ... (you could use other technique, such as IDF)

In [None]:
stopw = stopwords.words('english')
docrep=[]
for words,tag in documents:
    features = FreqDist(w for w in words if w.lower() not in stopw and w.lower() not in string.punctuation)
    docrep.append( (features, tag) )

Let's take a look at our documents again...

In [None]:
for doc in docrep[:5]:
    print(doc)

## NLTK classifier: Naive Bayes

Defining our training and test sets...

In [None]:
numtrain = int(len(documents) * 80 / 100)  # number of training documents

In [None]:
train_set, test_set = docrep[:numtrain], docrep[numtrain:]

In [None]:
print(test_set[0])

In [None]:
from nltk.classify import NaiveBayesClassifier as nbc

In [None]:
classifier = nbc.train(train_set)
print("Accuracy:", nltk.classify.accuracy(classifier, test_set))

Outra forma de avaliar a Accuracy...

In [None]:
from nltk.metrics import scores
test_ref = [tag for doc,tag in test_set]
test_pred = classifier.classify_many([doc for doc,tag in test_set])
print("Accuracy:", scores.accuracy(test_pred, test_ref) )

In [None]:
classifier.show_most_informative_features(5)

Now, with vocabulary selection ...

In [None]:
features_freq=FreqDist()
for wordsf, t in docrep:
    features_freq += wordsf
features_freq.most_common(10)

In [None]:
selected_features=[f for f,ntimes in features_freq.most_common(500)]

Using the word *frequency* in each document... (after executing, go back and test the performance)

In [None]:
train_set = [({w:f for w,f in wordsf.items() if w in selected_features}, tag) for wordsf,tag in docrep[:numtrain]]
test_set = [({w:f for w,f in wordsf.items() if w in selected_features}, tag) for wordsf,tag in docrep[numtrain:]]

In [None]:
print(test_set[0])

For each one of the *selected_features*, use its frequency in each document... (after executing, go back and test the performance)

In [None]:
train_set = [({f:wordsf[f] for f in selected_features}, tag) for wordsf,tag in docrep[:numtrain]]
test_set = [({f:wordsf[f] for f in selected_features}, tag) for wordsf,tag in docrep[numtrain:]]

In [None]:
print(test_set[0])

Crie um novo modelo com esta nova representação e avalie o seu desempenho

In [None]:
classifier = nbc.train(train_set)
print("Accuracy:", nltk.classify.accuracy(classifier, test_set))

## Now with part-of-speech TAGS

In [None]:
nltk.pos_tag(nltk.word_tokenize("time flies like an arrow"))

In [None]:
nltk.pos_tag(["he", "flies"])

In [None]:
print(documents[0])

In [None]:
nltk.pos_tag(documents[0][0])[:10]

In [None]:
stopw = stopwords.words('english')
docrep=[]
for words,tag in documents:
    features = FreqDist("%s_%s"%(w,p) for w,p in nltk.pos_tag(words) if w.lower() not in stopw and w.lower() not in string.punctuation)
    docrep.append( (features, tag) )

In [None]:
docrep[0]

In [None]:
posfeatures_freq=FreqDist()
for ff, t in docrep:
    posfeatures_freq += ff
posfeatures_freq.most_common(10)
selected_posfeatures=[f for f,freq in posfeatures_freq.most_common(500)]

In [None]:
train_set = [({f:ff[f] for f in selected_posfeatures}, tag) for ff,tag in docrep[:numtrain]]
test_set = [({f:ff[f] for f in selected_posfeatures}, tag) for ff,tag in docrep[numtrain:]]

In [None]:
print(test_set[0])

Lets check the results again ...

In [None]:
classifier = nbc.train(train_set)
print("Accuracy:", nltk.classify.accuracy(classifier, test_set))
classifier.show_most_informative_features(5)

## Exercício
Calcule o desempenho do TextBlob usando o mesmo conjunto de teste.

<!--
y=[]
y_pred=[]
for fn in docnames[numtrain:]:
    y.append(fn.split('/')[0])
    if TextBlob(mr.raw(fn)).sentiment.polarity >= 0:
        y_pred.append("pos")
    else:
        y_pred.append("neg")

from sklearn import metrics
print("Accuracy: ", metrics.accuracy_score(y, y_pred))
-->

## Now much faster, using some useful scikit-learn functions 


In [None]:
import sklearn
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report

### Assuming that documents are shuffled
Go back to the shuffle section, and make sure `docnames` contain a shuffled list of documents 

In [None]:
documents=[]
tags = []
for doc in docnames:
    documents.append(mr.raw(doc))
    tags.append( doc.split('/')[0])

In [None]:
for i in range(2):
    print(tags[i], documents[i])

In [None]:
numtrain = int(len(documents) * 80 / 100)  # number of training documents
train_documents, test_documents = documents[:numtrain], documents[numtrain:]
train_tags, test_tags = tags[:numtrain], tags[numtrain:]

In [None]:
vectorizer = CountVectorizer()
train_X = vectorizer.fit_transform(train_documents)
test_X = vectorizer.transform(test_documents)

In [None]:
print(vectorizer.get_feature_names()[:1000])

In [None]:
print(train_X.shape, test_X.shape)

In [None]:
classifier = MultinomialNB()

In [None]:
classifier.fit(train_X, train_tags)

In [None]:
pred = classifier.predict(test_X)

In [None]:
score = metrics.accuracy_score(test_tags, pred)
print("accuracy:   %0.3f" % score)