# Creating our own classifier
Lets use [Sentiment Polarity Dataset 2.0](https://www.cs.cornell.edu/people/pabo/movie-review-data/), included in the `NLTK` library.<Br>
It consists of 1000 positive and 1000 negative processed reviews. Introduced in Pang/Lee ACL 2004. Released June 2004.

In [44]:
import nltk
from nltk.corpus import stopwords
from collections import defaultdict
from nltk import word_tokenize
import string

In [45]:
from nltk.corpus import movie_reviews as mr
print("The corpus contains %d reviews"% len(mr.fileids()))

The corpus contains 2000 reviews


In [46]:
for i in mr.fileids()[995:1005]: # Reviews 995 to 1005
    print(i, "==>", i.split('/')[0])

neg/cv461_21124.txt ==> neg
neg/cv187_14112.txt ==> neg
pos/cv137_15422.txt ==> pos
pos/cv544_5108.txt ==> pos
pos/cv464_15650.txt ==> pos
pos/cv284_19119.txt ==> pos
neg/cv458_9000.txt ==> neg
pos/cv399_2877.txt ==> pos
neg/cv721_28993.txt ==> neg
pos/cv502_10406.txt ==> pos


Let's see the content of one of these reviews

In [47]:
print(mr.raw(mr.fileids()[995]))

the tagline on random hearts reads " in a perfect world , they never would have met . " 
in a perfect world , i never would have seen this movie . 
the biggest flaw is that 20 minutes into this film , kay chandler ( kristin scott thomas ) and dutch van den broeck ( harrison ford ) are the only two major characters alive ; resulting in little doubt that they will end up together at some point during the laborious two-hours-and-then-some production . 
dutch is a sergeant in internal affairs at the district of columbia police department . 
kay is a congresswoman from new hampshire . 
although they both think they are happily married , their spouses are cheating with each other behind their backs . 
dutch and kay are soon widowed when a plane goes down carrying their partners ; they subsequently discover the affair . 
the rest of the film is the pointless , unrealistic and often-times boring story of their researching the sexual relationship that they were blind to , and getting to know ea

### Checking wich are the most frequent words

Calculating the frequency of each word in the document ...

In [48]:
from nltk.probability import FreqDist
FreqDist(mr.raw(mr.fileids()[1]).split())

FreqDist({',': 69, 'the': 40, 'to': 27, '.': 24, '"': 22, 'a': 21, 'and': 20, 'of': 18, 'i': 14, 'in': 14, ...})

Lets take a look at the most frequent words in the corpus

The previous code has flaws because split() is a very basic way of finding the words. Let's use `word_tokenize()` or `mr.words()` instead...

In [49]:
wordfreq = FreqDist()
for i in mr.fileids():
    wordfreq += FreqDist(w.lower() for w in word_tokenize(mr.raw(i)))
print(wordfreq)
print(wordfreq.most_common(30))

<FreqDist with 46462 samples and 1525039 outcomes>
[(',', 77717), ('the', 76276), ('.', 65876), ('a', 37995), ('and', 35404), ('of', 33972), ('to', 31772), ('is', 26054), ('in', 21611), ("'s", 18128), ('``', 17625), ('it', 16059), ('that', 15912), (')', 11781), ('(', 11664), ('as', 11349), ('with', 10782), ('for', 9918), ('this', 9573), ('his', 9569), ('film', 9443), ('i', 8850), ('he', 8840), ('but', 8604), ('on', 7249), ('are', 7204), ('by', 6218), ("n't", 6217), ('be', 6083), ('an', 5742)]


stop words and punctuation are causing trouble, lets remove them...

In [50]:
stopw = stopwords.words('english')
wordfreq = FreqDist()
for i in mr.fileids():
    wordfreq += FreqDist(w.lower() for w in word_tokenize(mr.raw(i)) 
        if w.lower() not in stopw and w.lower() not in string.punctuation)
print(wordfreq)
print(wordfreq.most_common(30))

<FreqDist with 46290 samples and 746109 outcomes>
[("'s", 18128), ('``', 17625), ('film', 9443), ("n't", 6217), ('movie', 5671), ('one', 5582), ('like', 3547), ('even', 2556), ('good', 2316), ('time', 2282), ('would', 2264), ('story', 2146), ('--', 2055), ('much', 2024), ('character', 1996), ('also', 1965), ('get', 1925), ('characters', 1858), ('two', 1827), ('first', 1769), ('see', 1731), ('way', 1669), ('well', 1656), ('could', 1609), ('make', 1593), ('really', 1556), ('films', 1520), ('little', 1490), ('life', 1483), ('plot', 1460)]


## Shuffling

Lets shuffle the documents, otherwise they will remain sorted ["neg", "neg" ... "pos"]

In [51]:
import random

Lets read each document into words ...

In [52]:
docnames=mr.fileids()
random.shuffle(docnames)
documents=[]
for i in docnames:
    y = i.split('/')[0]
    documents.append( (mr.raw(i), y) )

Let's take a look at our documents...

In [53]:
for docs in documents[0:2]:
    print(docs)

('in 1995 , brian singer and christopher mcquarrie dreamed up a simple concept : the audience isn\'t stupid . \nfrom that , they went on and created the most plot-driven , intricately pieced movie in the last 25 years . \nthe result : the usual suspects , one hell of a movie that redefines the word plot twist . \nthe story is convoluted , and is really confusing to read , although easy to follow on screen . \nspecial investigator kujan ( chazz palminteri ) grills " verbal " kint ( kevin spacey ) , a crippled con-man who is the lone survivor of an la boat explosion that claimed more than 20 victims . \nkujan wants to confirm that his nemesis , the rogue cop keaton ( gabriel byrne ) , is actually dead . \nkint relates the majority of the film in flashback , beginning with the fateful day when five shifty guys meet in a police-station lineup in new york city . \nalong with dour keaton , kint encounters cheerfully sociopathic mcmanus ( stephen baldwin ) , mordantly sarcastic hockney ( kevi

## Document representation

Now, lets produce the final document representation, in the form of a Frequency Distribution ...

First, without stop words and punctuation ... (you could use other technique, such as IDF)

In [54]:
stopw = stopwords.words('english')
docrep=[]
for text,tag in documents:
    features = FreqDist(w for w in word_tokenize(text) 
                        if w.lower() not in stopw and w.lower() not in string.punctuation)
    docrep.append( (features, tag) )

Let's take a look at our documents again...

In [55]:
for doc in docrep[:5]:
    print(doc)

(FreqDist({'movie': 6, 'film': 6, 'one': 5, 'audience': 4, '``': 4, 'kint': 4, 'spacey': 4, 'character': 4, 'singer': 3, 'mcquarrie': 3, ...}), 'pos')
(FreqDist({'film': 5, 'jack': 4, "'s": 3, 'emotion': 3, 'rose': 3, 'titanic': 2, 'made': 2, 'movie': 2, 'paxton': 2, 'ship': 2, ...}), 'pos')
(FreqDist({'``': 10, 'trekkies': 9, 'star': 6, 'trek': 6, 'conventions': 5, 'would': 5, 'time': 5, 'film': 4, 'one': 4, "'s": 3, ...}), 'pos')
(FreqDist({'doctor': 11, "'s": 7, '``': 6, 'would': 6, 'master': 5, 'movie': 5, 'time': 4, 'series': 4, 'daleks': 4, 'good': 4, ...}), 'pos')
(FreqDist({'``': 42, "'s": 17, 'like': 16, 'eva': 14, 'musical': 13, 'film': 7, 'musicals': 6, "n't": 6, 'juan': 6, 'good': 6, ...}), 'pos')


## NLTK classifier: Naive Bayes

Defining our training and test sets...

In [56]:
numtrain = int(len(docrep) * 80 / 100)  # number of training documents

In [57]:
train_set, test_set = docrep[:numtrain], docrep[numtrain:]

In [58]:
print(test_set[0])

(FreqDist({'niagara': 8, "'s": 7, 'marcy': 4, 'tunney': 4, '--': 3, "n't": 3, 'know': 3, 'thomas': 2, 'couple': 2, 'journey': 2, ...}), 'pos')


In [59]:
from nltk.classify import NaiveBayesClassifier as nbc

In [60]:
classifier = nbc.train(train_set)
print("Accuracy: {:.3f}".format( nltk.classify.accuracy(classifier, test_set)))

Accuracy: 0.740


Outra forma (mais genérica) de avaliar a Accuracy...

In [61]:
from nltk.metrics import scores
test_ref = [tag for doc,tag in test_set]
test_pred = classifier.classify_many([doc for doc,tag in test_set])
print("Accuracy: {:.3f}".format(scores.accuracy(test_pred, test_ref) ) )

Accuracy: 0.740


In [62]:
classifier.show_most_informative_features(5)

Most Informative Features
               ludicrous = 1                 neg : pos    =     20.7 : 1.0
                   great = 4                 pos : neg    =     16.5 : 1.0
               marvelous = 1                 pos : neg    =     13.8 : 1.0
               wonderful = 2                 pos : neg    =     13.2 : 1.0
                  avoids = 1                 pos : neg    =     12.5 : 1.0


Para simplificar as próximas etapas, vamos criar um procedimento que faz o treino e a avaliação logo de seguida....

In [63]:
from nltk.classify import NaiveBayesClassifier as nbc
from nltk.metrics import scores

def train_and_evaluate(train_set, test_set):
    classifier = nbc.train(train_set)
    test_ref = [tag for doc,tag in test_set]
    test_pred = classifier.classify_many([doc for doc,tag in test_set])
    print("Accuracy: {:.3f}".format(scores.accuracy(test_pred, test_ref) ) )
    return classifier

In [64]:
classifier = train_and_evaluate(train_set, test_set)

Accuracy: 0.740


### Now, let's select only the most relevant words ...

In [65]:
feature_counts=FreqDist()
for doc_feature_counts, _ in train_set:
    feature_counts += doc_feature_counts
print(feature_counts.most_common(30))

[("'s", 14488), ('``', 14382), ('film', 7454), ("n't", 5006), ('movie', 4536), ('one', 4449), ('like', 2812), ('even', 2031), ('time', 1852), ('would', 1814), ('good', 1797), ('story', 1688), ('--', 1680), ('much', 1613), ('character', 1586), ('also', 1541), ('get', 1516), ('two', 1468), ('characters', 1459), ('first', 1407), ('see', 1374), ('well', 1336), ('way', 1335), ('could', 1274), ('make', 1254), ('really', 1231), ('films', 1209), ('plot', 1182), ('little', 1173), ('life', 1154)]


In [66]:
selected_features=[f for f,ntimes in feature_counts.most_common(1000)]

Using the word *frequency* in each document... (after executing, test the performance)

In [67]:
def docs2features(docs, selected_features):
    features = []
    for doc_feature_counts, tag in docs:
        features.append(({w:f for w,f in doc_feature_counts.items() if w in selected_features}, tag))
    return features

train_set = docs2features(docrep[:numtrain], selected_features)
test_set = docs2features(docrep[numtrain:], selected_features)

In [68]:
print(test_set[0])

({'r': 1, 'bob': 1, "'s": 7, 'follows': 1, 'unlike': 1, 'lot': 1, 'movies': 1, 'wild': 1, 'robin': 1, 'meet': 1, 'running': 1, 'local': 1, 'couple': 2, 'scenes': 1, 'later': 1, 'two': 1, 'small': 1, 'american': 1, 'town': 1, 'wants': 1, 'along': 2, 'way': 2, 'true': 1, 'love': 1, 'sets': 1, 'apart': 1, 'though': 1, 'acting': 1, 'thriller': 1, 'delivers': 1, 'performance': 1, 'best': 1, 'actress': 1, 'last': 1, 'year': 1, 'film': 1, 'work': 1, 'makes': 1, '--': 3, 'often': 1, 'act': 1, 'much': 1, 'convincing': 1, 'subtle': 1, 'chemistry': 1, 'mostly': 1, 'writer': 1, 'involving': 1, 'michael': 2, 'takes': 1, 'brings': 1, 'story': 1, 'key': 1, 'character': 1, 'highly': 1, 'power': 1, 'turn': 1, 'level': 1, 'would': 1, 'otherwise': 1, 'opens': 1, '``': 2, "n't": 3, 'know': 3, 'expect': 1, 'like': 1, 'something': 1, 'chase': 1, 'long': 1, 'get': 1, 'still': 1, 'first': 1}, 'pos')


In [69]:
classifier = train_and_evaluate(train_set, test_set)

Accuracy: 0.672


For each one of the *selected_features*, use its frequency in each document... (after executing, go back and test the performance)

In [70]:
def docs2features(docs, selected_features):
    features = []
    for doc_feature_counts, tag in docs:
        features.append(({f:doc_feature_counts[f] for f in selected_features}, tag))
    return features

train_set = docs2features(docrep[:numtrain], selected_features)
test_set = docs2features(docrep[numtrain:], selected_features)

In [71]:
print(test_set[0])

({"'s": 7, '``': 2, 'film': 1, "n't": 3, 'movie': 0, 'one': 0, 'like': 1, 'even': 0, 'time': 0, 'would': 1, 'good': 0, 'story': 1, '--': 3, 'much': 1, 'character': 1, 'also': 0, 'get': 1, 'two': 1, 'characters': 0, 'first': 1, 'see': 0, 'well': 0, 'way': 2, 'could': 0, 'make': 0, 'really': 0, 'films': 0, 'plot': 0, 'little': 0, 'life': 0, 'people': 0, 'bad': 0, 'never': 0, 'scene': 0, 'man': 0, 'best': 1, 'new': 0, 'many': 0, 'know': 3, 'scenes': 1, 'movies': 1, 'another': 0, 'great': 0, 'director': 0, 'go': 0, 'end': 0, 'love': 1, 'us': 0, 'action': 0, 'something': 1, 'seems': 0, 'made': 0, 'world': 0, 'back': 0, 'still': 1, 'big': 0, 'however': 0, 'work': 1, 'makes': 1, "'re": 0, 'every': 0, 'better': 0, 'though': 1, 'seen': 0, 'audience': 0, 'enough': 0, 'take': 0, 'going': 0, 'around': 0, 'things': 0, 'gets': 0, 'may': 0, 'performance': 1, 'real': 0, 'role': 0, 'thing': 0, 'think': 0, 'last': 1, 'look': 0, "'ve": 0, 'nothing': 0, 'john': 0, 'years': 0, 'actually': 0, 'funny': 0, 'r

In [72]:
classifier = train_and_evaluate(train_set, test_set)

Accuracy: 0.767


## Now with part-of-speech TAGS

In [73]:
nltk.pos_tag(nltk.word_tokenize("I like when he flies like the flies"))

[('I', 'PRP'),
 ('like', 'VBP'),
 ('when', 'WRB'),
 ('he', 'PRP'),
 ('flies', 'VBZ'),
 ('like', 'IN'),
 ('the', 'DT'),
 ('flies', 'NNS')]

In [74]:
print(documents[0])

('in 1995 , brian singer and christopher mcquarrie dreamed up a simple concept : the audience isn\'t stupid . \nfrom that , they went on and created the most plot-driven , intricately pieced movie in the last 25 years . \nthe result : the usual suspects , one hell of a movie that redefines the word plot twist . \nthe story is convoluted , and is really confusing to read , although easy to follow on screen . \nspecial investigator kujan ( chazz palminteri ) grills " verbal " kint ( kevin spacey ) , a crippled con-man who is the lone survivor of an la boat explosion that claimed more than 20 victims . \nkujan wants to confirm that his nemesis , the rogue cop keaton ( gabriel byrne ) , is actually dead . \nkint relates the majority of the film in flashback , beginning with the fateful day when five shifty guys meet in a police-station lineup in new york city . \nalong with dour keaton , kint encounters cheerfully sociopathic mcmanus ( stephen baldwin ) , mordantly sarcastic hockney ( kevi

In [75]:
nltk.pos_tag(word_tokenize(documents[0][0]))[:10]

[('in', 'IN'),
 ('1995', 'CD'),
 (',', ','),
 ('brian', 'JJ'),
 ('singer', 'NN'),
 ('and', 'CC'),
 ('christopher', 'NN'),
 ('mcquarrie', 'NN'),
 ('dreamed', 'VBD'),
 ('up', 'RP')]

In [76]:
stopw = stopwords.words('english')
docrep=[]
for text,tag in documents:
    features = FreqDist("%s_%s"%(w,p) for w,p in nltk.pos_tag(word_tokenize(text)) 
                        if w.lower() not in stopw and w.lower() not in string.punctuation)
    docrep.append( (features, tag) )

In [77]:
docrep[0]

(FreqDist({'movie_NN': 6, 'film_NN': 6, 'one_CD': 5, 'audience_NN': 4, '``_``': 4, 'kint_NN': 4, 'spacey_NN': 4, 'character_NN': 4, "n't_RB": 3, 'last_JJ': 3, ...}),
 'pos')

In [78]:
feature_counts=FreqDist()
for doc_feature_counts, t in docrep:
    feature_counts += doc_feature_counts
feature_counts.most_common(10)
selected_features=[f for f,freq in feature_counts.most_common(1000)]

In [79]:
def docs2features(docs, selected_features):
    features = []
    for doc_feature_counts, tag in docs:
        features.append(({f:doc_feature_counts[f] for f in selected_features}, tag))
    return features

train_set = docs2features(docrep[:numtrain], selected_features)
test_set = docs2features(docrep[numtrain:], selected_features)

In [80]:
print(test_set[0])

({'``_``': 2, "'s_POS": 6, 'film_NN': 1, "'s_VBZ": 1, "n't_RB": 3, 'movie_NN': 0, 'one_CD': 0, 'like_IN': 1, 'even_RB': 0, 'good_JJ': 0, 'time_NN': 0, 'would_MD': 1, 'story_NN': 1, '--_:': 3, 'character_NN': 1, 'also_RB': 0, 'characters_NNS': 0, 'two_CD': 1, 'way_NN': 2, 'could_MD': 0, 'well_RB': 0, 'really_RB': 0, 'first_JJ': 1, 'films_NNS': 0, 'life_NN': 0, 'people_NNS': 0, 'plot_NN': 0, 'get_VB': 0, 'bad_JJ': 0, 'scene_NN': 0, 'never_RB': 0, 'little_JJ': 0, 'man_NN': 0, 'make_VB': 0, 'see_VB': 0, 'new_JJ': 0, 'many_JJ': 0, 'scenes_NNS': 1, 'much_JJ': 0, 'movies_NNS': 1, 'best_JJS': 1, 'great_JJ': 0, 'another_DT': 0, 'director_NN': 0, 'action_NN': 0, 'us_PRP': 0, 'something_NN': 1, 'still_RB': 1, 'seems_VBZ': 0, 'world_NN': 0, 'makes_VBZ': 1, "'re_VBP": 0, 'however_RB': 0, 'big_JJ': 0, 'every_DT': 0, 'audience_NN': 0, 'seen_VBN': 0, 'performance_NN': 1, 'going_VBG': 0, 'role_NN': 0, 'gets_VBZ': 0, 'may_MD': 0, 'back_RB': 0, 'real_JJ': 0, 'things_NNS': 0, 'years_NNS': 0, 'end_NN': 0, 

Lets check the results again ...

In [81]:
classifier = train_and_evaluate(train_set, test_set)

Accuracy: 0.785


In [82]:
classifier.show_most_informative_features(5)

Most Informative Features
                great_JJ = 4                 pos : neg    =     16.5 : 1.0
            wonderful_JJ = 2                 pos : neg    =     12.5 : 1.0
               stupid_JJ = 2                 neg : pos    =     10.9 : 1.0
            excellent_JJ = 2                 pos : neg    =      9.1 : 1.0
               boring_JJ = 2                 neg : pos    =      8.9 : 1.0


## Exercício
Calcule o desempenho do TextBlob usando o mesmo conjunto de teste.

<!--
y=[]
y_pred=[]
for fn in docnames[numtrain:]:
    y.append(fn.split('/')[0])
    if TextBlob(mr.raw(fn)).sentiment.polarity >= 0:
        y_pred.append("pos")
    else:
        y_pred.append("neg")

from sklearn import metrics
print("Accuracy: ", metrics.accuracy_score(y, y_pred))
-->