<h1 align="center"> Introdução ao Processamento de Linguagem Natural (PLN) Usando Python </h1>
<h3 align="center"> Professor Fernando Vieira da Silva MSc.</h3>

<h2>Problema de Classificação</h2>

<p>Neste tutorial vamos trabalhar com um exemplo prático de problema de classificação de texto. O objetivo é identificar uma sentença como escrita "formal" ou "informal".</p>

<b>1. Obtendo o corpus</b>

<p>Para simplificar o problema, vamos continuar utilizando o corpus Gutenberg como textos formais e vamos usar mensagens de chat do corpus <b>nps_chat</b> como textos informais.</p>
<p>Antes de tudo, vamos baixar o corpus nps_chat:</p>

In [1]:
import nltk

nltk.download('nps_chat')

[nltk_data] Downloading package nps_chat to
[nltk_data]     /home/fernando/nltk_data...
[nltk_data]   Unzipping corpora/nps_chat.zip.


True

In [2]:
from nltk.corpus import nps_chat

print(nps_chat.fileids())

['10-19-20s_706posts.xml', '10-19-30s_705posts.xml', '10-19-40s_686posts.xml', '10-19-adults_706posts.xml', '10-24-40s_706posts.xml', '10-26-teens_706posts.xml', '11-06-adults_706posts.xml', '11-08-20s_705posts.xml', '11-08-40s_706posts.xml', '11-08-adults_705posts.xml', '11-08-teens_706posts.xml', '11-09-20s_706posts.xml', '11-09-40s_706posts.xml', '11-09-adults_706posts.xml', '11-09-teens_706posts.xml']


<p>Agora vamos ler os dois corpus e armazenar as sentenças em uma mesma ndarray. Perceba que também teremos uma ndarray para indicar se o texto é formal ou não. Começamos armazenando o corpus em lists. Vamos usar apenas 500 elementos de cada, para fins didáticos.</p>

In [3]:
import nltk

x_data_nps = []

for fileid in nltk.corpus.nps_chat.fileids():
    x_data_nps.extend([post.text for post in nps_chat.xml_posts(fileid)])

y_data_nps = [0] * len(x_data_nps)

x_data_gut = []
for fileid in nltk.corpus.gutenberg.fileids():
    x_data_gut.extend([' '.join(sent) for sent in nltk.corpus.gutenberg.sents(fileid)])
    
y_data_gut = [1] * len(x_data_gut)

x_data_full = x_data_nps[:500] + x_data_gut[:500]
print(len(x_data_full))
y_data_full = y_data_nps[:500] + y_data_gut[:500]
print(len(y_data_full))

1000
1000


<p>Em seguida, transformamos essas listas em ndarrays, para usarmos nas etapas de pré-processamento que já conhecemos.</p>

In [4]:
import numpy as np

x_data = np.array(x_data_full, dtype=object)
#x_data = np.array(x_data_full)
print(x_data.shape)
y_data = np.array(y_data_full)
print(y_data.shape)

(1000,)
(1000,)


<b>2. Dividindo em datasets de treino e teste</b>

<p>Para que a pesquisa seja confiável, precisamos avaliar os resultados em um dataset de teste. Por isso, vamos dividir os dados aleatoriamente, deixando 80% para treino e o demais para testar os resultados em breve.</p>

In [5]:
train_indexes = np.random.rand(len(x_data)) < 0.80

print(len(train_indexes))
print(train_indexes[:10])

1000
[ True  True  True  True  True  True False False False  True]


In [6]:
x_data_train = x_data[train_indexes]
y_data_train = y_data[train_indexes]

print(len(x_data_train))
print(len(y_data_train))

808
808


In [7]:
x_data_test = x_data[~train_indexes]
y_data_test = y_data[~train_indexes]

print(len(x_data_test))
print(len(y_data_test))

192
192


<b>3. Treinando o classificador</b>

<p>Para tokenização, vamos usar a mesma função do tutorial anterior:</p>

In [8]:
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import string
from nltk.corpus import wordnet

stopwords_list = stopwords.words('english')

lemmatizer = WordNetLemmatizer()

def my_tokenizer(doc):
    words = word_tokenize(doc)
    
    pos_tags = pos_tag(words)
    
    non_stopwords = [w for w in pos_tags if not w[0].lower() in stopwords_list]
    
    non_punctuation = [w for w in non_stopwords if not w[0] in string.punctuation]
    
    lemmas = []
    for w in non_punctuation:
        if w[1].startswith('J'):
            pos = wordnet.ADJ
        elif w[1].startswith('V'):
            pos = wordnet.VERB
        elif w[1].startswith('N'):
            pos = wordnet.NOUN
        elif w[1].startswith('R'):
            pos = wordnet.ADV
        else:
            pos = wordnet.NOUN
        
        lemmas.append(lemmatizer.lemmatize(w[0], pos))

    return lemmas
    
    

<p>Mas agora vamos criar um <b>pipeline</b> contendo o vetorizador TF-IDF, o SVD para redução de atributos e um algoritmo de classificação. Mas antes, vamos encapsular nosso algoritmo para escolher o número de dimensões para o SVD em uma classe que pode ser utilizada com o pipeline:</p>

In [9]:
from sklearn.decomposition import TruncatedSVD

class SVDDimSelect(object):
    def fit(self, X, y=None):               
        self.svd_transformer = TruncatedSVD(n_components=X.shape[1]/2)
        self.svd_transformer.fit(X)
        
        cummulative_variance = 0.0
        k = 0
        for var in sorted(self.svd_transformer.explained_variance_ratio_)[::-1]:
            cummulative_variance += var
            if cummulative_variance >= 0.5:
                break
            else:
                k += 1
                
        self.svd_transformer = TruncatedSVD(n_components=k)
        return self.svd_transformer.fit(X)
    
    def transform(self, X, Y=None):
        return self.svd_transformer.transform(X)
        
    def get_params(self, deep=True):
        return {}

<p>Finalmente podemos criar nosso pipeline:</p>

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn import neighbors

clf = neighbors.KNeighborsClassifier(n_neighbors=10, weights='uniform')

my_pipeline = Pipeline([('tfidf', TfidfVectorizer(tokenizer=my_tokenizer)),\
                       ('svd', SVDDimSelect()), \
                       ('clf', clf)])

<p>Estamos quase lá... Agora vamos criar um objeto <b>RandomizedSearchCV</b> que fará a seleção de hiper-parâmetros do nosso classificador (aka. parâmetros que não são aprendidos durante o treinamento). Essa etapa é importante para obtermos a melhor configuração do algoritmo de classificação. Para economizar tempo de treinamento, vamos usar um algoritmo simples o <i>K nearest neighbors (KNN)</i>.

In [11]:
from sklearn.grid_search import RandomizedSearchCV
import scipy

par = {'clf__n_neighbors': range(1, 60), 'clf__weights': ['uniform', 'distance']}


hyperpar_selector = RandomizedSearchCV(my_pipeline, par, cv=3, scoring='accuracy', n_jobs=2, n_iter=20)




<p>E agora vamos treinar nosso algoritmo, usando o pipeline com seleção de atributos:</p>

In [12]:
#print(hyperpar_selector)

hyperpar_selector.fit(X=x_data_train, y=y_data_train)

  Q = random_state.normal(size=(A.shape[1], size))
  Q = random_state.normal(size=(A.shape[1], size))
  return V[:n_components, :].T, s[:n_components], U[:, :n_components].T
  return V[:n_components, :].T, s[:n_components], U[:, :n_components].T
  Q = random_state.normal(size=(A.shape[1], size))
  Q = random_state.normal(size=(A.shape[1], size))
  return V[:n_components, :].T, s[:n_components], U[:, :n_components].T
  return V[:n_components, :].T, s[:n_components], U[:, :n_components].T
  Q = random_state.normal(size=(A.shape[1], size))
  Q = random_state.normal(size=(A.shape[1], size))
  return V[:n_components, :].T, s[:n_components], U[:, :n_components].T
  return V[:n_components, :].T, s[:n_components], U[:, :n_components].T
  Q = random_state.normal(size=(A.shape[1], size))
  Q = random_state.normal(size=(A.shape[1], size))
  return V[:n_components, :].T, s[:n_components], U[:, :n_components].T
  return V[:n_components, :].T, s[:n_components], U[:, :n_components].T
  Q = random_sta

RandomizedSearchCV(cv=3, error_score='raise',
          estimator=Pipeline(steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ...wski',
           metric_params=None, n_jobs=1, n_neighbors=10, p=2,
           weights='uniform'))]),
          fit_params={}, iid=True, n_iter=20, n_jobs=2,
          param_distributions={'clf__n_neighbors': range(1, 60), 'clf__weights': ['uniform', 'distance']},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          scoring='accuracy', verbose=0)

In [13]:
print("Best score: %0.3f" % hyperpar_selector.best_score_)
print("Best parameters set:")
best_parameters = hyperpar_selector.best_estimator_.get_params()
for param_name in sorted(par.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

Best score: 0.635
Best parameters set:
	clf__n_neighbors: 3
	clf__weights: 'distance'


<b>4. Testando o classificador</b>

<p>Agora vamos usar o classificador com o nosso dataset de testes, e observar os resultados:</p>

In [14]:
from sklearn.metrics import *

y_pred = hyperpar_selector.predict(x_data_test)

print(accuracy_score(y_data_test, y_pred))

0.734375


<b>5. Serializando o modelo</b><br>

In [15]:
import pickle

string_obj = pickle.dumps(hyperpar_selector)

In [16]:
model_file = open('model.pkl', 'wb')

model_file.write(string_obj)

model_file.close()

<b>6. Abrindo e usando um modelo salvo </b><br>

In [17]:

model_file = open('model.pkl', 'rb')
model_content = model_file.read()

obj_classifier = pickle.loads(model_content)

model_file.close()

res = obj_classifier.predict(["what's up bro?"])

print(res)

[0]


In [18]:
res = obj_classifier.predict(x_data_test)
print(accuracy_score(y_data_test, res))

0.734375


In [19]:
res = obj_classifier.predict(x_data_test)

print(res)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1
 0 0 1 0 1 1 0 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 0 0 1 0 0 1 1 0 0 1 1 0 0 0 0
 0 0 0 0 1 0 0 0 0 1 1 0 0 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 1
 0 1 1 1 0 0 0]


In [20]:
formal = [x_data_test[i] for i in range(len(res)) if res[i] == 1]

for txt in formal:
    print("%s\n" % txt)


i already wrote what i wanted you to read.

Sorrow came -- a gentle sorrow -- but not at all in the shape of any disagreeable consciousness .-- Miss Taylor married .

I could not walk half so far ."

He will be able to tell her how we all are ."

Mr . Knightley had a cheerful manner , which always did him good ; and his many inquiries after " poor Isabella " and her children were answered most satisfactorily .

When this was over , Mr . Woodhouse gratefully observed , " It is very kind of you , Mr . Knightley , to come out at this late hour to call upon us .

" Well !

" By the bye -- I have not wished you joy .

" But , Mr . Knightley , she is really very sorry to lose poor Miss Taylor , and I am sure she _will_ miss her more than she thinks for ."

You made a lucky guess ; and _that_ is all that can be said ."

Poor Mr . Elton !

You like Mr . Elton , papa ,-- I must look about for a wife for him .

Captain Weston was a general favourite ; and when the chances of his military life ha

In [22]:
informal = [x_data_test[i] for i in range(len(res)) if res[i] == 0]

for txt in informal:
    print("%s\n" % txt)

10-19-20sUser7 is a gay name.

.ACTION gives 10-19-20sUser121 a golf clap.

:)

don't golf clap me.

fuck you 10-19-20sUser121:@

whats everyone up to?

PART

JOIN

ewwwww lol

r u serious

JOIN

I'll take one, please.

26/m

JOIN

JOIN

Anyone from Tennessee in here?

10-19-20sUser121 is missing a B in her name

and i don't complain about things being hard very often.

JOIN

PART

brb

JOIN

PART

hey any guys with cams wanna play?

hey 10-19-20sUser126

PART

what did you but on e-bay

yeee haw 10-19-20sUser30

wb 10-19-20sUser139

PART

you should make it 'iamahotnip', 10-19-20sUser44

hi 10-19-20sUser139.

ahah "iamahotniplickme"

Hi 10-19-20sUser121

10-19-20sUser136.. get the hell outta my freaking PM box.. Im with my fiance!!!!!!!!!!!!!!!!

I like it when you do it, 10-19-20sUser83

uh huh 

i have one already 10-19-20sUser7... yayayayayyy!!

OOooOO:)

'iamahotnipwithhotnippics

lmao!!!

I just laughed

finger?

10-19-20sUser141... get outta my PM Box.. didnt ya hear!!!!

you ca

In [21]:
res2 = obj_classifier.predict(["Emma spared no exertions to maintain this happier flow of ideas , and hoped , by the help of backgammon , to get her father tolerably through the evening , and be attacked by no regrets but her own"])

print(res2)

[1]
